We are a team of athletes and sports lovers, so we were interested in using data to research a problem in the world of professional sports. We decided to examine NFL injuries since particular injuries are common in football at the professional level, and their incidence and severity play a large role in determining the players’ comeback to play and a team’s success. Injuries don’t occur at random; we’d like to deduce some patterns underlying injury incidence to better understand the risks involved for players and teams.
In particular, we would like to examine which positions are most at risk for injuries, which injuries are most common, which teams have the most injuries (and if these teams are consistently the same each year), and whether injury incidence has evolved over time (and in particular, if concussion rates have decreased following the introduction of new helment technology in 2017).
We were initially interested in examining 7 questions, stated below:
We ultimately decided to only consider questions 1-4, for a few reasons.
For question 5, we ran into a time constraint: because we built our own data for this project by web scraping pro-football-reference.com instead of using a pre-built dataset, finding and building the additional data that would be needed to answer our question about weather conditions would have cost time that was better spent answering questions 1-4 to the best of our abilities.
We decided against studying questions 6 because as we began to work more closely with our dataset, we realized that the structure of our injury data source did not lend itself to answering our question about injury duration. It was not possible to reliably discern the length of time each athlete was injured due to one particular injury because, for example, it was common for athletes to have multiple injuries at once. Not to mention: we realized there was no way to know the duration of injuries that were incurred at the end of the season and which healed in the off-season.
As for question 7, we realized that this question could be subsumed by other questions under consideration. For example, in our analysis of question 1 we ran a logistic regression predicting binary injury occurrence using the covariates under study in question 7.
1) Data source Our data source was pro-football-reference.com. This website publishes a report of injuries for each NFL team, with a column for every game played by the team and a row for each player who had been injured during the season. The website has injury reports from 2009 to 2021.
2) Web scraping method
A webscraping algorithm was created using rvest which iterated across each year available on the website (2009-2021) and across each NFL team. We started the webscraping script at a webpage which had a link for each team’s main page. We then used the html_attr function to select all elements of type href to get all links including in the webpage. We then used string matching to identify which of those links corresponded to team pages.
After each team’s main page was identified, we modified the URL to select the injury page for a given year. Because the injury pages had more information than just player injuries, the html_elements function was used to select only the table including player injury information. From that table, the html_attr function was used to select appropriate data, such as injury status, the text description, and whether the player had a “Did not play” specifier for each game.
3) Data cleaning
a) Data shaping
In order to conduct analyses on player injuries, injury tables were transformed into a table with one row per injured player per year. To do this, we first selected only the regular season games from each table in order to give equivalent estimates for teams who made it to the playoffs. We then counted the number of games a player was listed as injured (the sum of non-blank entries in the table) and the number of games a player was listed as not playing. In order to describe injuries, we concatenated all unique injury descriptions for a given player into one string. We then had a dataset with one row per player per season, which allowed us to conduct season-wide analyses.
Additional player demographic data (such as height, weight, age, and position) were provided by gridironai.com. Because this dataset had one row for every player for every week, duplicates were removed so that there was only one row per player per season.
Although the dataset from gridironai.com was the most complete of any dataset we identified on the internet (including attempting to webscrape aplayer information directly from pro-football-reference.com), it did not contain every player who had been listed as injured in the injury tables. We kept this in mind by carefully considering which types of join of the two datasets would result in selection bias for our analyses.
An important note is that for ease of analysis we excluded injuries sustained in the post-season (as some players play in the post-season and some do not) to remove a source of bias when counting injuries in a given season or otherwise analyzing injury incidence.
b) Injury classification
Once the data was scraped, there were issues with the free text in the injury column. In order to work with this data set to answer some of our primary questions, we needed to clean the data in such a way that the injuries could be easily modeled and analyzed. With so many injuries initially reported, as seen in the exploratory analysis below, we grouped the injuries into 8 main categories based on a part of the body.
These included: Head, Shoulder, Upper Torso, Lower Torso, Arm, Hand, Leg and Foot. This way, the distribution of injuries was much easier to evaluate and draw conclusions from all while still keeping the injury data significant to each player. Once this decision was made, we went about the cleanup of the free text by first removing any special characters separating injuries from each other (mostly “") and replacing them with a space. Many injuries were read like :”knee arm concussion head" as one large string. To deal with this issue we created a code and function that allowed each specific injury to be re-coded in its appropriate category. For example the above string would be recoded as “leg arm head head”. Therefore we classified this player as having 1 leg injury, 1 arm injury and 2 head injuries.
Once the injuries were re-coded, we used the mutate() function and created count columns for each of 8 body part injuries summing up how many of those injuries each player had. Once this was complete, we used the group_by() function and the summarise() function to obtain the the total counts of each body part injury. This way, we were able to move forward with answering some of our primary questions with a data set that was usable.
We conducted exploratory data analysis on each of our 4 questions of interest separately. Below we will describe initial analysis and results for each question.
To answer this question, we used the web scraped injury and player demographic data. We merged these datasets and created factors where appropriate. We then created a binary injury variable that had a value of “1” for injured players and a “0” for non-injured players. The summary statistics below were computed for each position over all the years:
| position_id | mean | sd | Q1 | median | Q3 |
|---|---|---|---|---|---|
| DEF | 511.8 | 130.1 | 413 | 603 | 617 |
| K | 10.38 | 5.009 | 10 | 11 | 12 |
| OL | 192.6 | 40.31 | 180 | 199 | 210 |
| P | 6.769 | 3.7 | 4 | 8 | 10 |
| QB | 32.69 | 12.01 | 24 | 36 | 41 |
| RB | 95.69 | 24.68 | 81 | 107 | 115 |
| TE | 66.85 | 15.79 | 62 | 69 | 77 |
| WR | 130.3 | 25.03 | 116 | 131 | 149 |
We could interpret the mean value for defensive players (DEF) by stating that about 512 DEF players are injured per season. Although informative, this approach does not provide insight into the number of injuries that a defensive player could expect to incur in a given season. To obtain this information, we created a new variable that counts the total number of injuries for each player. We then grouped by position and year to find the total number of injuries for each position. This value was then divided by the number of players and then the number of seasons to produce the average injuries per player per year, broken down by position. The only complication here is that while our dataset contains all NFL players who were injured, we do not have a complete count of all players who were not injured, meaning that we do not have an accurate count of players who had zero injuries. To account for this, we will instead find the average number of injuries per player per season among players who are injured, or the expected number of injuries for a player conditioned on them being injured at least once. These results as well as other summary statistics are presented below.
| position_id | avg_total_injuries | SD | Q1 | Median | Q3 | Min | Max |
|---|---|---|---|---|---|---|---|
| DEF | 1.643 | 1.035 | 1 | 1 | 2 | 1 | 16 |
| K | 1.17 | 0.3774 | 1 | 1 | 1 | 1 | 2 |
| OL | 1.565 | 0.9145 | 1 | 1 | 2 | 1 | 9 |
| P | 1.205 | 0.4589 | 1 | 1 | 1 | 1 | 3 |
| QB | 1.689 | 1.405 | 1 | 1 | 2 | 1 | 21 |
| RB | 1.711 | 1.107 | 1 | 1 | 2 | 1 | 10 |
| TE | 1.623 | 1.031 | 1 | 1 | 2 | 1 | 13 |
| WR | 1.697 | 1.042 | 1 | 1 | 2 | 1 | 9 |
Now, we can state the expected number of injuries for a DEF player is about 2 per season, or that a quarterback (QB) can also expect to get injured about 2 times per season.
In addition to total injury counts over the total 2009-2021 period, we are also concerned with the evolution of these counts over time. We chart this below.
We chose to use a log base 10 scale on the y-axis because this makes it easier to see the lines towards the bottom of the graph, which all overlap on the original scale. When viewing the plot, it is important to note that some positions have more players on the field than others. For example, there are eleven defensive players on at a given time but only one quarterback. This suggests that scaling should be performed to account for the different group sizes. Scaling by the number of players in each position would essentially be the average number of injuries per player per position. As in the table above, we can only calculate the expected number of injuries among players who were injured at least once. The plot of these averages over time is presented below:
Note that none of these values are less than 1, because all values are conditioned on players being injured at least once.
Another important part of our analysis was checking for missing data. The table below shows that we had a small number of missing measurements for player height/weight/bmi. These values were removed prior to the analyses.
| injury | position_id | height_inches | weight_pounds | game_starter | age |
|---|---|---|---|---|---|
| 0 | 0 | 13 | 11 | 0 | 0 |
| bmi | year |
|---|---|
| 13 | 0 |
After completing our initial EDA, we turned our attention to the classification task of predicting whether a player will be injured in a given season given their position, as well as other player information (e.g., height, weight, age, team). And, where possible, we sought to interpret odds ratios explaining the relationship between player position and injury risk. The three main classification algorithms we explored were logistic regression, k-Nearest Neighbors (kNN), and Random Forest. These methods were selected due to the categorical nature of our outcome (injured or not), and because of the varying degrees of flexibility afforded by these approaches. We first proceeded with an exploratory analysis to assess the suitability of logistic regression for this task.
From here, we opted to perform forward selection as a more systematic way of finding good features for the model. The process of forward selection begins by regressing the outcome variable on just the intercept, and then subsequently adds covariates to the model based on which obtains the lowest AIC value. The final results of this process appear below:
##
## Call:
## glm(formula = injury ~ 1, family = binomial(), data = players)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.20 -1.20 1.16 1.16 1.16
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.0423 0.0123 3.45 0.00055 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 36933 on 26649 degrees of freedom
## Residual deviance: 36933 on 26649 degrees of freedom
## AIC: 36935
##
## Number of Fisher Scoring iterations: 3
##
## Call:
## glm(formula = injury ~ game_starter + year + position_id + age +
## weight_pounds + I(age^2), family = binomial(), data = players)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.016 -1.074 0.613 1.056 2.243
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.46e+02 7.27e+00 33.86 < 2e-16 ***
## game_starter 1.03e+00 2.77e-02 37.36 < 2e-16 ***
## year -1.23e-01 3.59e-03 -34.33 < 2e-16 ***
## position_idDEF 5.21e-02 5.54e-02 0.94 0.3471
## position_idK -1.01e+00 1.18e-01 -8.54 < 2e-16 ***
## position_idOL -1.48e-01 6.21e-02 -2.39 0.0170 *
## position_idP -1.33e+00 1.33e-01 -10.00 < 2e-16 ***
## position_idQB -7.61e-01 8.67e-02 -8.77 < 2e-16 ***
## position_idRB 1.90e-01 6.94e-02 2.75 0.0060 **
## position_idWR 1.30e-01 6.74e-02 1.94 0.0529 .
## age 1.45e-01 4.66e-02 3.10 0.0019 **
## weight_pounds -2.22e-03 3.84e-04 -5.79 7.2e-09 ***
## I(age^2) -1.92e-03 8.33e-04 -2.30 0.0212 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 36933 on 26649 degrees of freedom
## Residual deviance: 33561 on 26637 degrees of freedom
## AIC: 33587
##
## Number of Fisher Scoring iterations: 4
We see that Game starter (yes/no), year, position, age, age_squared, and weight (lbs) were the most useful predictors. Age squared was added as a potential predictor because of the possibility that age and injury status may have a quadratic relationship, where the risk of injury increases as players get older, but then decreases after a certain age since fewer people play football beyond 40. After completing the initial variable screening, we began fitting the machine learning models. The data was divided for the training and test sets, with the former receiving 70% of the data and the latter receiving the remaining 30%. The Confusion Matrix, Accuracy, Sensitivity, Specificity of the models are presented below. In addition, the k parameter for kNN was found by two-fold cross-validation. We chose k = 21 as this is the point where the accuracy began to level off.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2347 1325
## 1 1566 2757
##
## Accuracy : 0.638
## 95% CI : (0.628, 0.649)
## No Information Rate : 0.511
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.276
##
## Mcnemar's Test P-Value : 8.06e-06
##
## Sensitivity : 0.675
## Specificity : 0.600
## Pos Pred Value : 0.638
## Neg Pred Value : 0.639
## Prevalence : 0.511
## Detection Rate : 0.345
## Detection Prevalence : 0.541
## Balanced Accuracy : 0.638
##
## 'Positive' Class : 1
##
This model correctly classified players in the testing set as being injured or not injured during a given season 63.8% of the time. The sensitivity of 0.675 indicates that among players who were actually injured, the model predicted 67.5% of them to be injured. The specificity of 0.600 indicates that among players who were not injured, the model corrected classified 60.0% of them as not injured. The specificity and sensitivity are reasonably balanced, but the specificity is slightly higher, meaning that the model is better at accurately predicting who will be injured than who will not be injured.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2110 1555
## 1 1803 2527
##
## Accuracy : 0.58
## 95% CI : (0.569, 0.591)
## No Information Rate : 0.511
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.158
##
## Mcnemar's Test P-Value : 2.02e-05
##
## Sensitivity : 0.619
## Specificity : 0.539
## Pos Pred Value : 0.584
## Neg Pred Value : 0.576
## Prevalence : 0.511
## Detection Rate : 0.316
## Detection Prevalence : 0.542
## Balanced Accuracy : 0.579
##
## 'Positive' Class : 1
##
This model correctly classified players in the testing set as being injured or not injured during a given season 58.0% of the time. The sensitivity of 0.619 indicates that among players who were actually injured, the model predicted 61.9% of them to be injured. The specificity of 0.539 indicates that among players who were not injured, the model corrected classified 53.9% of them as not injured. The specificity and sensitivity are again reasonably balanced, but the specificity is still slightly higher than the sensitivity, meaning that the model is better at accurately predicting who will be injured than who will not be injured. The accuracy, sensitivity, and specificity are all lower in this model than in the previous model.
##
## Call:
## randomForest(formula = injury ~ position_id + age + I(age^2) + game_starter + weight_pounds + year, data = train_set)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 36%
## Confusion matrix:
## 0 1 class.error
## 0 5863 3267 0.358
## 1 3455 6070 0.363
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2504 1504
## 1 1409 2578
##
## Accuracy : 0.636
## 95% CI : (0.625, 0.646)
## No Information Rate : 0.511
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.271
##
## Mcnemar's Test P-Value : 0.0816
##
## Sensitivity : 0.632
## Specificity : 0.640
## Pos Pred Value : 0.647
## Neg Pred Value : 0.625
## Prevalence : 0.511
## Detection Rate : 0.322
## Detection Prevalence : 0.499
## Balanced Accuracy : 0.636
##
## 'Positive' Class : 1
##
This model correctly classified players in the testing set as being injured or not injured during a given season 63.7% of the time. The sensitivity of 0.632 indicates that among players who were actually injured, the model predicted 63.2% of them to be injured. The specificity of 0.640 indicates that among players who were not injured, the model corrected classified 64.0% of them as not injured. This model has the most balanced sensitivity and specificity, meaning that the model is equally good at classifying injured and non-injured people.
From the results, we see that the Logistic Regression obtained the highest accuracy of 0.638, with the kNN and Random Forest receiving accuracies of 0.58 and 0.637, respectively. The ROC curves and corresponding AUC values are presented below.
auc(roc_logi) # calculate AUC for each ROC curve
## Area under the curve: 0.701
auc(roc_knn)
## Area under the curve: 0.608
auc(roc_rf)
## Area under the curve: 0.692
We now will compare the three models based on Accuracy, Sensitivity, Specificity, and AUC:
| Model | Accuracy | Sensitivity | Specificity | AUC |
|---|---|---|---|---|
| Logistic | 0.638 | 0.675 | 0.6 | 0.701 |
| kNN (k = 21) | 0.58 | 0.619 | 0.539 | 0.608 |
| Random Forest | 0.637 | 0.632 | 0.64 | 0.692 |
We see that the logistic regression model has the highest accuracy, but the random forest has almost the same accuracy. We also see that the logistic regression model has the highest AUC. The k-nearest neighbors model is clearly the worst model under both of these metrics, so we will move on to look at sensitivity and specificity for the other two models. The random forest is slightly more balanced in terms of sensitivity/specificity. However, the logistic regression attains a higher sensitivity.
All things considered, the logistic regression has the highest accuracy and the highest AUC of all 3 models. It also provides us with clearly interpretable odds ratios for the risk of injury. For these reasons, we will choose logistic regression as the best model for our situation.
We are also interested in understanding which injuries are most common in the NFL; we begin by simply charting the distribution of injuries, using the very granular injury type classifications available from our data source.
We see here that looking at each specific injury does give us information, but after about the first half of the injury list, the graph is not very impactful. We do conclude however from this graph that knee, ankle and hamstring injuries were the most common injury in NFL football players. After this initial EDA, injuries were group into more general body part injuries as head, shoulder, upper torso, lower torso, arm, hand, leg and foot. This way we were able to analyze the distribution of general injuries more clearly and come to a meaningful conclusion. This can be seen in the RShiny app provided in the EDA files and on the project website.
We first consider the total injuries per year over the entire 2009-2021 period, as well as the max injuries that a team had in a single season over this period.
It is clear that the Houston Texans have the most total injuries over this period, as well as the maximum injuries. Most teams had between 600-700 injuries over this period, and the Texans’ 170 injuries in a single season (2011) is a full 20% higher than the second most injuries per season during this time frame (142 for the Cleveland Browns in 2012).
Let’s take a look at the distribution of injury counts over each year in the 2009-2021 period, as well as the distribution of injury counts as shown in a boxplot.
From the final chart it appears that the Texans’ large number of injuries are concentrated in the 2011-2013 range. There is a sizable spread in the median injuries per season by team over the 2009-2020 seasons; the KC Chiefs had 35 injuries per year on median compared to whopping 85 for the Texans. So while it is the case that the Texan’s very high number of total injuries over this period is due to a few large outliers in 2011-2013, their high median injuries suggests that even in a typical season they have more injuries than most other teams.
Here, we will explore trends over time in NFL injuries.
First, we will create simple plot with the number of injures recorded each season.
It looks like there have been fewer injuries in recent years, but we can explore this data much further.
We can also look at the typical number of injuries players get in a season:
It is important to note that this data only includes players who were injured at least once, but it is helpful in understanding the typical distribution of the number of injuries per person. Among injured players, the median number of injuries per players was just one injury each season, but a few unlucky players experienced more than 10 injuries in a single season.
Concussions in the NFL have received significant attention in the last several years due to concerns regarding CTE and permanent brain damage. The NFL has worked to reduce the frequency of concussions among its players. Has there been a decline in the number of concussions each season since 2009?
While there is significant variation over time in this plot, it appears that there is a downward trend over time in the number of concussions each season.
Now that we have explored concussions more closely, let’s look at trends for all injury types.
Interestingly, we see similar trends across all eight injury types. Note that the y-axis is displayed on the log scale to improve readability by spreading out the lower lines on this plot.
We can visualize this information in a barplot as well, which is even easier to interpret:
The stacked barplot allows us to see the cumulative trends in injuries over time, as well as the breakdown by injury type.
While all of the previous plots show the same trend in injury counts over time, none of them provide any insight into the severity of each injury. Lastly, we want to explore the average number of games missed due to injury per player per season, as well as the average number of games injured per player per season, which includes players who play through injuries. These plots are shown below:
The plots actually show entirely different trends, highlighting the importance of exploring any data set extensively before drawing conclusongs about the data. In the first plot, we see the there a spike in the number of games missed due to injury from 2016-2018, which corresponds with a sharp drop in raw injury counts in the other plots. It is possible that in these years players had fewer injuries than in prior years, but these injuries were more severe, leading to more games missed per injury. It is also possible that new rules force injured players to sit out of games even when they want to play through their injuries. The second plot depicts a more consistent trend in games played while injured with a steep drop-off in 2019. It is interesting to note that the trends in these two plots match reasonably well from 2016 to 2020, with the number of games injured only slightly higher than the number of games missed due to injury. Before 2016, it appears that there were many more games played while injured, because the number of games injured is much higher than the number of games missed. This may be due to changes in NFL rules that prevent injured players from injuring themselves further by continuing to play.
Many of our key conclusions are stated above in the exploratory analysis section but we will expand upon them and summarize results here.
1. Which positions are most at-risk for injuries?
As for assessing which positions were most at risk for injury, we were constrained by the position classification available at our data source; this source only broke out positions for offensive positions and grouped all defensive players together. Further, because of limitations of our data source, we could only count injuries among those players who were injured at least once per season over the 2009-2021 period in question. We found that among those players injured at least once, running backs had the highest average total injuries per player, per season at 1.71, whereas kickers injured at least once had only 1.17 injuries per kicker per season. Of course, in order to properly evaluate the injury risk associated with each position, we would have liked to have averaged in the players who were not injured at each position per season, but this was unfortunately not possible given the constraint of our data source.
However, we have another tool to more fully assess the injury risk at each position even though our data source did not include 100% of non-injured players: logistic regression. We built a logistic regression model to predict the binary outcome of injured vs. non-injured in a given season, using a carefully selected set of covariates, including a categorical variable for position. The regression output of this model allowed us to interpret odds ratios to better understand the injury risk associated with each position. Doing so, we again find that running back is the highest injury-risk position in the NFL: the odds of a running back sustaining at least one injury in a given NFL season is 1.21 [95% CI: 1.06-1.39] times that of a tight end (the reference category in our regression), on average, holding constant all other covariates in our model. On the other hand, we find that kicker is the “safest” position: the odds of a kicker sustaining at least one injury in a given NFL season is 0.364 [95% CI: 0.289-0.459] times that of a tight end on average, holding constant other covariates.
As for the classification task of assigning certain players to injured vs. non-injured in a given season: we find that logistic regression outperforms kNN and Random Forest, achieving the highest ROC (0.7), accuracy (0.638), and sensitivity (0.675).
2. Which injuries are most common in the NFL?
We now turn our attention to the question of which types of injuries most commonly afflict NFL players. After extensively cleaning and reclassifying our injury variable, we observe that leg injuries are by far the most common (the most common sub-injures of the leg are knee and hamstring), followed by foot injuries; whereas arm injuries are least common. These trends hold for offense and defense separately. They are also quite consistent for each offense position broken out separately. Our Shiny app allows for easy and interactive plotting of these results, and is available here.
3. Which teams have the most injuries, and are these teams consistently the same ones year over year?
A somewhat surprising result of this analysis is that over the 2009-2020 period, it did indeed appear as if one team typically had more injuries than the others: the Houston Texans. The Texans had the highest total injuries, the max injuries in a single season, and highest median injuries per season over the period in question. The injury per season distribution was quite similar for each team apart from the Texans and Browns (which had many more injuries than most) and the Steelers and Chiefs (which had many fewer injuries than most).
4. How has injury incidence evolved over time in the NFL?
A key finding of this analysis is that, according to our data source, injuries fell dramatically for nearly every team in 2016 compared to prior years, and injury levels have remained at this lower level since. We searched for a possible explanation for this fact (for example, relating to rule changes or reporting changes imposed by the NFL), and we also interrogated our data source for changes in its data collection methodolgy around this time, but were not able to identify a clear reason that we observe this steep decline in 2016.
[Write up thing about how number of games missed went up- though i’m kind of concerned about including this because it is pretty counter-intuitive and makes it seems like our data doesn’t make sense]
Another finding of this section was that, not surprisingly, the composition of injuries (by injury type we defined) was relatively consistent between 2008-2021, with leg and foot injuries accounting for the plurality in each year.